This R Markdown document includes contributions from Professor Bret Larget
tidyverse for these
lectures.COURSE/scripts/viridis.RCOURSE/scripts/ggprob.Rggprob.R contains a number of functions for
graphing probability distributions in a tidyverse-friendly manner.Binomial distributions are discrete distributions that consider the number of “successes” in a fixed series of \(n\) trials.
Example: Flipping a coin 10 times and recording the number of heads (successes) would follow a binomial distribution with \(n = 10\) and probability of success \(p = 0.5\) (assuming the coin is fair!).
Binomial distributions are used when we want to know about the occurrence of an event, not its magnitude.
A binomial random variable satisfies the following properties:
The mean is \(\mu = np\)
The variance is \(\sigma^2 = np(1-p)\)
The standard deviation is \(\sigma = \sqrt{np(1-p)}\)
We will not go over the mathematical derivation of these of these formulas, but can demonstrate them with a quick simulation.
The function rbinom() may be used to generate random
binomial variables.
We will keep the values in a tibble and then summarize and graph them.
B = 1000000
n = 100
p = 0.5
binomial_example = tibble(
x = rbinom(B, n, p)
)
## use as.data.frame to print more digits
binomial_example %>%
summarize(mean = mean(x),
variance = var(x),
sd = sd(x)) %>%
as.data.frame()
## mean variance sd
## 1 49.99191 24.98402 4.998402
The simulation does not prove that the formulas are correct, but is consistent with the theory.
Try doing the same thing with different numerical values!
We can also make a bar graph of the numbers generated in the simulation.
aes(y = after_stat(density)) changes
the scale on the y axis to plot as a density instead of as counts.ggplot(binomial_example, aes(x=x)) +
geom_histogram(aes(y = after_stat(density)),
center = 50, binwidth = 1,
color = "black", fill = "firebrick") +
ylab("Probability") +
ggtitle("Simulated Binomial Distribution",
subtitle = "n = 100, p = 0.5") +
theme_minimal()
geom_segment() to plot the
observed proportions.
ggplot()binomial_example %>%
count(x) %>%
ungroup() %>%
mutate(p = n/sum(n)) %>%
ggplot(aes(x=x, y=p, xend=x, yend=0)) +
geom_segment(color = "blue") +
geom_hline(yintercept = 0) +
ylab("Probability") +
ggtitle("Simulated Binomial Distribution",
subtitle = "n = 100, p = 0.5") +
theme_minimal()
ggarrange() to arrange the plotsexpand_grid() which
creates a data frame with all possible combinations of
n_options and p_optionspmap() from purrr to
iterate over the rows of a data frame.x
is the first argument n_options and .y is the
second argument p_optionsn_options = 2^seq(0, 10, by = 1)
p_options = c(0.1, 0.5, 0.9)
binomial_plots = expand_grid(n_options, p_options) %>%
pmap(~{
gbinom(.x, .y,
color = if_else(.y == 0.1, "red",
if_else(.y == 0.5, "purple", "blue")),
scale = TRUE) +
theme_minimal()
})
ggarrange(plots = binomial_plots,
ncol = length(p_options),
nrow = length(n_options))
\[ P(X = k) = \binom{n}{k} p^k(1-p)^{n-k}, \qquad \text{for $k=0,1,\ldots,n$} \]
\[ \binom{n}{k} = \frac{n!}{k!(n-k)!} \]
The expression \(p^k (1-p)^{n-k}\) is the probability of each single sequence of length \(n\) with exactly \(k\) symbols of one type and \(n-k\) symbols of another.
The probability that there are exactly \(k\) “successes” in a sequence of \(n\) independent trials with success probability \(p\) is the sum of the probabilities of all the ways this could occur.
choose() calculates \(\binom{n}{k}\)factorial() calculates \(n!\)
lgamma() where
lgamma(x+1) is equal to \(\ln
x!\).dbinom() to
calculate binomial probabilities without worrying about the formula or
numerical instabilities.n = 5
p = 0.5
k = 2
choose(n, k)
## [1] 10
factorial(n)/(factorial(k)*factorial(n-k)) # n! /(k!(n-k)!)
## [1] 10
choose(n, k)*p^k*(1-p)^(n-k)
## [1] 0.3125
dbinom(k,n,p) # see below
## [1] 0.3125
rbinom(n, size, prob)
dbinom(x, size, prob)
pbinom(q, size, prob)
qbinom(p, size, prob)
Note that with these functions, the parameters for the Binomial
are size and prob (what we’ve been referring
to as \(n\) and \(p\), respectively).
The argument n in rbinom() is the
sample size, how many random variables for which to generate
values.
# Simulate binomial random variables
n = 5
p = 0.5
x = rbinom(6, n, p)
x
## [1] 4 2 3 3 4 3
mean(x) # estimate based on sample
## [1] 3.166667
n*p # true mean
## [1] 2.5
var(x) # estimate based on sample
## [1] 0.5666667
n*p*(1-p) # true variance
## [1] 1.25
## Binomial "density" calculations
dbinom(0:n, n, p)
## [1] 0.03125 0.15625 0.31250 0.31250 0.15625 0.03125
dbinom(4, n, p)
## [1] 0.15625
choose(n, 4)*p^4*(1-p)^(n-4)
## [1] 0.15625
ggprob.R.## Binomial "density" plot
gbinom(n,p)
## Binomial distribution calculations
## P(X <= x) = F(x), where F is the distribution function.
# P(X <= 3):
pbinom(3, n, p)
## [1] 0.8125
dbinom(0, n, p) + dbinom(1, n, p) + dbinom(2, n, p) + dbinom(3, n, p)
## [1] 0.8125
1 - dbinom(4, n, p) - dbinom(5, n, p)
## [1] 0.8125
# P(X > 3):
1 - pbinom(3, n, p) # 1 - P(X <= 3)
## [1] 0.1875
pbinom(3, n, p, lower.tail=FALSE) # P(X > 3)
## [1] 0.1875
dbinom(4, n, p) + dbinom(5, n, p)
## [1] 0.1875
# P(X < 3):
pbinom(3, n, p) - dbinom(3, n, p)
## [1] 0.5
pbinom(2, n, p) # P(X <= 2) = P(X < 3)
## [1] 0.5
## Binomial quantile calculations
## Docs: The quantile is defined as the smallest value x such that P(X<=x) = F(x) ≥ p, where F is the distribution function.
qbinom(.2, n, p) # which x such that P(X <= x) = 0.2; there may not be an exact x
## [1] 2
dbinom(0, n, p) + dbinom(1, n, p) + dbinom(2, n, p)
## [1] 0.5
dbinom(0, n, p) + dbinom(1, n, p)
## [1] 0.1875
What are the mean and standard deviation of the \(\text{Binomial}(90, 0.7)\) distribution?
The mean is \(\mu = (90)(0.7) = 63\).
The standard deviation is \(\sigma = \sqrt{90(0.7)(0.3)} = 4.347413\).
Plot the \(\text{Binomial}(90, 0.7)\) distribution. Add a red partly transparent solid vertical line at the mean, dashed vertical lines one standard deviation above and below the mean, and dotted lines two standard deviations above and below the mean.
n = 90
p = 0.7
mu = n*p
sigma = sqrt(n*p*(1-p))
gbinom(n, p, scale = TRUE) +
geom_vline(xintercept = mu, color = "red", alpha = 0.5) +
geom_vline(xintercept = mu + c(-1,1)*sigma,
color = "red", linetype = "dashed") +
geom_vline(xintercept = mu + c(-2,2)*sigma,
color = "red", linetype = "dotted") +
theme_minimal()
If \(X \sim \text{Binomial}(90, 0.7)\), what is \(P(X = \mu)\)?
dbinom(mu, n, p)
## [1] 0.09144637
If \(X \sim \text{Binomial}(90, 0.7)\), what is \(P(\mu - \sigma \leq X \leq \mu + \sigma)\)?
## endpoints
mu + c(-1,1)*sigma
## [1] 58.65259 67.34741
## Within one standard deviation
## P(mu - sigma <= X <= mu + sigma)
## P(59 <= X <= 67) = P(X <= 67) - P(X <= 58)
pbinom(67, n, p) - pbinom(58, n, p)
## [1] 0.6995895
## Sum of probabilities
sum( dbinom(59:67, n, p) )
## [1] 0.6995895
Find the 0.05 and 0.95 quantiles of the \(\text{Binomial}(90, 0.7)\) distribution.
qbinom(c(0.05, 0.95), n, p)
## [1] 56 70
## Compare to the tail probabilities
pbinom(56, n, p) # P(X <= 56) >= 0.05
## [1] 0.06954663
pbinom(55, n, p) # P(X <= 55) < 0.05
## [1] 0.04457106
## Compare to the tail probabilities
pbinom(70, n, p) # P(X <= 70) >= 0.95
## [1] 0.9609796
pbinom(69, n, p) # P(X <= 69) < 0.95
## [1] 0.9354706
Explain why each of the following random variables does not have a binomial distribution.
\(X_1\) is the number of times a fair coin is tossed until we have observed at least one head and one tail.
There is not a fixed number of trials.
\(X_2\) is the number of red balls drawn when one is picked from a bucket with 10% red balls, a second is picked from a bucket with 20% red balls, and a third is drawn from a bucket with 30% red balls.
The value of \(p\) is not the same for all trials.
\(X_3\) is the number of red balls selected in a sample of 5 chosen without replacement from a bucket with ten red balls and ten white balls.
The trials are not independent. The color of the first ball selected affects the probability of the color of the second ball, for example.
Make a line plot of the 0.75 quantile of binomial distributions with \(p=0.7\) and \(n\) varying from 1 to 500 with \(n\) on the x axis and the quantile on the y axis. Describe the shape of the relationship between these variables.
prob7 = tibble(
n = 1:500,
q75 = qbinom(0.75, n, 0.7))
ggplot(prob7, aes(x=n, y=q75)) +
geom_line()
If \(q_{0.75}(n)\) is the 0.75 quantile of the binomial distribution with \(p = 0.7\), define \(z = (q_{0.75}(n) - np)/\sqrt{np(1-p)}\). (Subtract the binomial mean and divide by the binomial standard deviation.) Make a line plot of \(z\) on the y axis and \(n\) varying from 1 to 500 on the x axis. Add a straight regression line to the plot. Describe the plot.
prob7 = prob7 %>%
mutate(mu = n*0.7,
sigma = sqrt(n*0.7*0.3),
z = (q75 - mu)/sigma)
ggplot(prob7, aes(x=n, y = z)) +
geom_line() +
geom_smooth(method = "lm")
prob7 %>%
summarize(z = mean(z))
## # A tibble: 1 × 1
## z
## <dbl>
## 1 0.683
Normal distributions are continuous distributions
All normal distributions have the same overall shape: symmetric, unimodal, bell-shaped
The shape of the normal density curve is completely determined by its mean (\(\mu\)) and its standard deviation (\(\sigma\))
The density function has the following formula
\[f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}\]
mu = 3
sigma = 0.5
mu2 = 7
sigma2 = 2
gnorm(mu=mu, sigma=sigma, a=-0, b=12, color="blue",
fill=NULL, title=FALSE) +
geom_norm_density(mu = mu2, sigma = sigma2, color = "magenta") +
theme_minimal()
gnorm() and
geom_norm_density() are defined in the file
ggprob.R and are not part of the
tidyverse.\[P(a \leq X \leq b)\]
This is the area under the density curve between \(b\) and \(a\)
The function pnorm() calculates areas under normal
curves.
The arguments in order are:
q: a location on the x axismean: the mean of the distribution, or \(\mu\), default value is 0sd: the standard deviation of the distribution, or
\(\sigma\), default value is 1lower.tail: logical to calculate area to the left,
default value is TRUElog.p: logical to calculate log of the probability,
default value is FALSE.## naming arguments; I usually do not
pnorm(q=80, mean=100, sd=25)
## [1] 0.2118554
## without the names
pnorm(80, 100, 25)
## [1] 0.2118554
## gnorm is defined in ggprob.R
## you need to call source("../../scripts/ggprob.R") in set-up for this to work
## a is the left endpoint of the interval to fill
## b is the right endpoint of the interval to fill
## use NULL for the default value (all the way to the left or right)
gnorm(mu = 100, sigma = 25) +
geom_norm_fill(mu = 100, sigma=25, a = NULL, b = 80) +
theme_minimal()
## using lower.tail
pnorm(90, 100, 25, lower.tail = FALSE)
## [1] 0.6554217
## using subtraction
1 - pnorm(90, 100, 25)
## [1] 0.6554217
## first argumment is mu, second is sigma if you do not want to specify names
## Need to repeat mu and sigma in each level: not set in aes()
gnorm(100, 25) +
geom_norm_fill(100, 25, a = 90, b = NULL) +
theme_minimal()
## use subtraction
pnorm(120, 100, 25) - pnorm(50, 100, 25)
## [1] 0.7653945
gnorm(100, 25) +
geom_norm_fill(100, 25, a = 50, b = 120) +
theme_minimal()
## using lower.tail
pnorm(70, 100, 25) + pnorm(130, 100, 25, lower.tail = FALSE)
## [1] 0.2301393
## using subtraction
pnorm(70, 100, 25) + (1 - pnorm(130, 100, 25))
## [1] 0.2301393
gnorm(100, 25) +
geom_norm_fill(100, 25, a = NULL, b = 70) +
geom_norm_fill(100, 25, a = 130, b = NULL) +
theme_minimal()
gnorm() +
theme_minimal()
If \(X \sim \text{N}(\mu, \sigma)\), then \(Z = \left(\frac{X-\mu}{\sigma}\right) \sim \text{N}(0, 1)\)
An implication is \(\mathsf{P}(X < a) = \mathsf{P}\left( Z < \frac{a-\mu}{\sigma}\right)\)
\(X \sim \text{N}(100, 25)\)
mu = 100
sigma = 25
x = 80
z = (x-mu)/25
## Using X directly
pnorm(x, mu, sigma)
## [1] 0.2118554
## Standardizing first
## default values for the mean and sd are 0 and 1
pnorm(z)
## [1] 0.2118554
gnorm(mu, sigma) +
geom_norm_fill(mu, sigma, b=x) +
theme_minimal()
gnorm() +
geom_norm_fill(b=z) +
theme_minimal()
The p-quantile is the location q where the area to
the left of q under the normal density is equal to
p.
The R function qnorm() makes these
calculations.
Arguments in order are:
p: the probability of the quantilemean: mean of the distribution, \(\mu\), default is 0sd: standard deviation of the distribution, \(\sigma\), default is 1lower.tail: default is TRUE, if
FALSE find locations for areas to the rightlog.p: default is FALSE, if true, it
returns natural log of the quantile\(X \sim \text{N}(100, 25)\)
Find the 0.1, 0.25, 0.9, 0.95, 0.975, and 0.99 quantiles of the distribution.
p = c(0.1, 0.25, 0.9, 0.95, 0.975, 0.99)
mu = 100
sigma = 25
qnorm(p, mu, sigma)
## [1] 67.96121 83.13776 132.03879 141.12134 148.99910 158.15870
## use ggarrange() to display multiple plots and map() to create a list of plots
normal_plots = p %>%
map(~{
gnorm(mu, sigma) +
geom_norm_fill(mu, sigma, a = NULL, b = qnorm(.x, mu, sigma)) +
theme_minimal()
})
ggarrange(plots = normal_plots,
nrow = length(p),
ncol = 1)
The weights of packets of cookies produced by a certain manufacturer have a normal distribution with a mean of 202 g and a standard deviation of 3 g. What is the weight that should be labeled on the packet so that only 1% of the packets are underweight?
## This gives the 1st percentile of a N(202, 3) distribution
qnorm(0.01, mean=202, sd=3, lower.tail=TRUE)
## [1] 195.021
## Check
pnorm(195.021, mean=202, sd=3) # P(X <= 195.021) = .01
## [1] 0.01000039
## Assumptions satisfied
n = 100
p = 0.5
mu = n*p
sigma = sqrt(n*p*(1-p))
gbinom(n, p, scale=TRUE) +
geom_norm_density(mu, sigma, color = "red") +
theme_minimal()
## Criteria not satisfied
n = 100
p = 0.01
mu = n*p
sigma = sqrt(n*p*(1-p))
gbinom(n, p, scale=TRUE) +
geom_norm_density(mu, sigma, color = "red") +
theme_minimal()
If we want to calculate a binomial probability, we should just do so directly
However, if we do want to use a normal approximation, your calculation will be more accurate if add or subtract 0.5 as appropriate.
The exact binomial probability that \(X=x\) is approximated by the area under a normal density curve between \(x-0.5\) and \(x+0.5\).
So, for example, when \(n = 20\) and \(p = 0.4\), the exact binomial probability \(\mathsf{P}(X \le 6)\) and the normal approximation.
## Exact calculation
pbinom(6, 20, 0.4)
## [1] 0.2500107
## Normal approximations without correction
## moments
mu = 20*0.4
sigma = sqrt(20*0.4*0.6)
## approximations
pnorm(6, mu, sigma)
## [1] 0.1806552
pnorm(7, mu, sigma)
## [1] 0.3240384
pnorm(6.5, mu, sigma)
## [1] 0.2467814
cc_example = tibble(
x = 0:20,
p = dbinom(x, 20, 0.4)
)
ggplot(cc_example, aes(x = x, y = p)) +
geom_col(fill = "gray") +
geom_norm_fill(mu = mu, sigma = sigma, b = 6, alpha = 0.5) +
geom_norm_density(mu, sigma, color = "blue") +
geom_hline(yintercept = 0) +
theme_minimal()
ggplot(cc_example, aes(x = x, y = p)) +
geom_col(fill = "gray") +
geom_norm_fill(mu = mu, sigma = sigma, b = 7, alpha = 0.5) +
geom_norm_density(mu, sigma, color = "blue") +
geom_hline(yintercept = 0) +
theme_minimal()
ggplot(cc_example, aes(x = x, y = p)) +
geom_col(fill = "gray") +
geom_norm_fill(mu = mu, sigma = sigma, b = 6.5, alpha = 0.5) +
geom_norm_density(mu, sigma, color = "blue") +
geom_hline(yintercept = 0) +
theme_minimal()
You will need the package tidyverse for this file
Also, install the broman package for the
myround() function which actually rounds output to the
requested number of digits (as a string).
This lecture use the following scripts, assumed to be in your course scripts directory.
COURSE/scripts/viridis.RCOURSE/scripts/ggprob.RThe file ggprob.R contains a number of functions for
graphing probability distributions in a tidyverse-friendly
manner.
pnorm() can give us a numerical
answer.1 - pnorm(300, 250, 30)
## [1] 0.04779035
gnorm(250, 30) +
geom_norm_fill(250, 30, a = 300)
Now, let’s say we forgot about the existence of
pnorm(), but we did know how to use rnorm() to
sample random variables.
The idea would be to sample many instances random variables with this distribution and merely count the proportion that are larger than 300.
First, we do a tidyverse solution where we create a data frame with the random sample in a column and then use dplyr commands to calculate the desired summary.
## setting the seed here so that we get the same values each time we knit
## We will only set the seed once in this file
set.seed(2022)
B = 1000000
prob1 = tibble(
x1 = rnorm(B, 250, 30),
x2 = rnorm(B, 250, 30),
x3 = rnorm(B, 250, 30),
x4 = rnorm(B, 250, 30))
prob1_prob = prob1 %>%
summarize(p1 = mean(x1 > 300),
p2 = mean(x2 > 300),
p3 = mean(x3 > 300),
p4 = mean(x4 > 300))
prob1_prob
## # A tibble: 1 × 4
## p1 p2 p3 p4
## <dbl> <dbl> <dbl> <dbl>
## 1 0.0477 0.0478 0.0480 0.0479
round(1 - pnorm(300, 250, 30), 4)
## [1] 0.0478
## A base R solution
prob1_base = mean( rnorm(B, 250, 30) > 300 )
round(prob1_base, 4)
## [1] 0.0478
Next, let’s try another example where we have theory to tell us the exact number.
Let \(X_2 \sim \text{Binomial}(500, 0.2)\) so the mean is \(\mathsf{E}(X_2) = 500 \times 0.2 = 100\) and the standard deviation is \(\sigma = \sqrt{500(0.2)(0.8)} \doteq 8.94\).
Find \(\mathsf{P}(X \ge 120)\).
As a review, let’s do the calculation exactly first with
pbinom().
## Notice P(X >= 120) = 1 - P(X <= 119)
prob2 = 1 - pbinom(119, 500, 0.2)
prob2
## [1] 0.0160918
B = 1000000
prob2_df = tibble(
x = rbinom(B, 500, 0.2)
)
prob2_sim = prob2_df %>%
summarize(prob2 = mean(x >= 120)) %>%
pull(prob2)
prob2_sim
## [1] 0.016149
Here is an example that delves into random sampling.
Suppose that \(X_1, X_2, \ldots, X_{10}\) each have a \(\text{Normal}(100, 20)\) distribution and that they are mutually independent.
Let \(\bar{X}\) be the sample mean, \(\bar{X} = (X_1 + \cdots + X_{10})/10\).
Find \(\mathsf{P}(95 < \bar{X} < 105)\).
We may write this set-up as \(X_i \overset{\text{iid}}{\sim} \text{Normal}(100, 20), \text{ for } i = 1,\ldots,10\).
map_dbl()
which is from the tidyverse package
purrr which is used for iteration.
_dbl specifies that the return value is a
vector of numerical values (double precision floating point
values).mean(rnorm(10, 100, 20)).B = 1000000
prob3_df = tibble(
xbar = map_dbl(1:B, ~mean(rnorm(10, 100, 20))))
prob3 = prob3_df %>%
summarize(p = mean( between(xbar, 95, 105))) %>%
pull(p)
prob3
## [1] 0.571454
prob3_sum = prob3_df %>%
summarize(mu = mean(xbar),
sigma = sd(xbar))
mu = prob3_sum %>%
pull(mu)
sigma = prob3_sum %>%
pull(sigma)
mu
## [1] 100.0044
sigma
## [1] 6.322996
ggplot(prob3_df, aes(x = xbar)) +
geom_density(color = "blue") +
geom_norm_density(mu, sigma, color = "red") +
geom_norm_fill(mu, sigma, a = 95, b = 105, fill = "red") +
geom_hline(yintercept = 0) +
xlab("Sample Mean") +
ylab("Density") +
ggtitle("Example 3: normal sample mean")
Notice that the estimated density curve is nearly a perfect match for the theoretical normal density with these moments.
This is not a coincidence:
For our numerical example, \(\text{SD}(\bar{X}) = 20 / \sqrt{10} \doteq 6.325\) which is very close to our observed sample standard deviation of 6.323.
If we had used theory and not simulation, we could have calculated the probability like this.
pnorm(105, 100, 20 / sqrt(10)) - pnorm(95, 100, 20 / sqrt(10))
## [1] 0.5708047
unif_0_10 = tibble(x = c(-1, 0, 10, 11),
y = c(0, 0.1, 0.1, 0))
ggplot(unif_0_10, aes(x=x, y=y)) +
geom_step(color = "blue") +
geom_hline(yintercept = 0) +
xlab("U") +
ylab("Density") +
ggtitle("Uniform(0,10) Distribution")
B = 1000000
prob4_df = tibble(
s20 = map_dbl(1:B, ~sum(runif(20, 0, 10))))
p4 = prob4_df %>%
summarize(p = mean(s20 > 115)) %>%
pull(p)
p4
## [1] 0.123557
prob4_sum = prob4_df %>%
summarize(mu = mean(s20),
sigma = sd(s20),
p = mean(s20 > 115))
prob4_sum
## # A tibble: 1 × 3
## mu sigma p
## <dbl> <dbl> <dbl>
## 1 100. 12.9 0.124
mu = prob4_sum$mu
sigma = prob4_sum$sigma
mu
## [1] 99.99506
sigma
## [1] 12.91052
ggplot(prob4_df, aes(x = s20)) +
geom_density(color = "blue") +
geom_norm_density(mu, sigma, color = "red") +
geom_norm_fill(mu, sigma, a = 115, fill = "red") +
geom_hline(yintercept = 0) +
xlab("Sum") +
ylab("Density") +
ggtitle("Sum of Uniforms and Normal Approximation")
1 - pnorm(115, 100, sqrt(2000/12))
## [1] 0.1226391
The probability distribution of a sample mean of size \(n\) from an iid random sample from a distribution with mean \(\mu\) and standard deviation \(\sigma\) has mean equal to \(\mu\), standard deviation equal to \(\sigma/\sqrt{n}\). Furthermore, if \(n\) is sufficiently large, this distribution of the sample mean is approximately normal.
Here is an example where the normal approximation fails, but simulation can still obtain an accurate value.
The gamma distribution is the distribution of a positive continuous random variable with a density function equal to \(f(x) = C x^{\alpha-1} \mathrm{e}^{-\lambda x}, \ x > 0\) where \(\alpha>0\) and \(\lambda>0\) are positive parameters which determine the shape and scale of the distribution, respectively, and \(C\) is a constant (depending on the values of \(\alpha\) and \(\lambda\), but not \(x\)) so that the total area under the density curve equals one.
When \(\alpha\) is small, the gamma density is strongly skewed to the right.
The mean of the distribution is \(\mu = \alpha / \lambda\) and the variance is \(\sigma^2 = \alpha / \lambda^2\) so that the standard deviation is \(\sigma = \sqrt{\alpha}/\lambda\). Here is a graph of the density when \(\alpha = 0.9\) and \(\lambda = 0.09\). The mean is \(\mu = 0.9 / 0.09 = 10\) and the standard deviation is \(\sigma = \sqrt{0.9}/0.09 \doteq 10.54\).
rgamma() function to generate random
gamma variables and use map_dbl() as earlier to take sample
means.B = 1000000
prob5_df = tibble(
xbar = map_dbl(1:B, ~mean(rgamma(4, 0.9, 0.09))))
alpha = 0.9
lambda = 0.09
mu = alpha/lambda
sigma = sqrt(alpha) / lambda
ggplot(prob5_df, aes(x=xbar)) +
geom_density(color = "blue") +
geom_norm_density(mu, sigma/2, color = "red") +
geom_norm_fill(mu, sigma/2, a = 20, fill = "red") +
geom_hline(yintercept = 0) +
xlab("Sample Mean") +
ylab("Density") +
ggtitle("Skewed Gamma Distribution Sample Mean, n=4")
prob5 = prob5_df %>%
summarize(p = mean(xbar > 20))
prob5
## # A tibble: 1 × 1
## p
## <dbl>
## 1 0.0491
mu
## [1] 10
sigma
## [1] 10.54093
1 - pnorm(20, mu, sigma/sqrt(4))
## [1] 0.02888979
prob5_exact = 1 - pgamma(20, 4*alpha, 4*lambda)
Simulation can be very accurate without any knowledge of the theory to make exact calculations and without knowing the true mean and standard deviation.
When the sample size \(n\) is large enough for a given problem, then a simple area under a normal curve can be very accurate.
B = 1000000
prob6_df = tibble(
xbar = map_dbl(1:B, ~mean(rgamma(4, alpha, lambda))))
alpha = 9
lambda = 0.9
mu = alpha/lambda
sigma = sqrt(alpha) / lambda
ggplot(prob6_df, aes(x=xbar)) +
geom_norm_density(mu, sigma/2, color = "red") +
geom_norm_fill(mu, sigma/2, a = 12, fill = "red") +
geom_density(color = "blue") +
geom_hline(yintercept = 0) +
xlab("Sample Mean") +
ylab("Density") +
ggtitle("More Symmetric Gamma Distribution, Sample Mean, n=4")
prob6 = prob6_df %>%
summarize(simulation = mean(xbar > 12),
clt = 1 - pnorm(12, mu, sigma/sqrt(4)),
exact = 1 - pgamma(12, 4*alpha, 4*lambda))
prob6
## # A tibble: 1 × 3
## simulation clt exact
## <dbl> <dbl> <dbl>
## 1 0.119 0.115 0.118